﻿ if(typeof(freecauseNotification) === "undefined") { var freecauseNotification 
= {}; function defineFreecauseNotification() { freecauseNotification = 
FREECAUSE.notification; } function checkFCNExists() { if(window.FREECAUSE === 
undefined || window.FREECAUSE.notification === undefined) 
window.setTimeout("checkFCNExists()", 100); else defineFreecauseNotification();
 } checkFCNExists(); }************************;
* Contents of directory ;
************************;

Title    : The Consequences of Industrialization: 
           Evidence from Water Pollution and Digestive Cancers in China
Date     : July 2010
Purpose  : In this readme file, I trace out the paths of each
	   of the data sets for the project, and the programs
	   underlying each of the empirical results. These files are 
	   sufficient for replication, but you would have to download 
	   the data and change the paths to fit your local machine. 
Note     : To view the contents of the directory, from your browser's
	   address bar you should delete readme.txt and hit
	   return. That allows you to access the files.
 
	   Enjoy!

FILES found in ~/research/pollution/ (symbolically linked with "pollution")

**************************;
* Data directories        ;
**************************;

1. GIS. The GIS directory is where most of the data are stored. The subdirectories
are:

a. water_points. This contains the 484 water quality monitoring points
assigned to river basins. See Pollutants_Counties_watersheds.dbf.

b. air_pollution. This contains the measures of long-term particulates
by level 6 river basin. See longterm_particulatesbybasin.csv.

c. rainfall. This contains precipitation rates by river basin. See
precipitation_by_basinlevel6.csv.

d. stream_lengths. This is the stream data, which identifies how far
each stream segment is from the headwaters/outlet.

e. stream_nodes. This is the data containing the points of
intersection of the streams.

f. Hydro1k. This is the base data from the Hydro1k project with China
divided into rivers basins. The key file here is
china_hydro1k_pfafcoded.csv, which is converted into hydro1k_pfaf.dta,
which has for each river basin the upstream/downstream basin at the
level3 through level6 aggregation. Note that currently the upstream
manufacturing instrument is based on the level4 basin upstream of each
dsp point. The data in hydro1k_pfaf.dta has 1,709 points,
corresponding to the river basins in the pan-Asia region.

g. Dissolved_Watersheds. This is the dissolved Hydro1k data for the
purpose of analyzing river basins at higher levels of aggregration
than the level 6 basins.

h. county_output. This is the county output point data (provided by Jing Cao
with a latitude/longitude marker) assigned to river basins. This is
how I figure out how much manufacturing output occurs in each
basin. The key file I later use is
Manufacturing_Points_wHydro1k_FullJoin_NoBad.csv, which has each of
the 3,470 points assigned to a river basin, as well as the
manufacturing output by industry recorded at each point.

i. county_points.  This is original Harvard Geospatial library copy of
the census, wich each county assigned information on what river basin
it is located, when the center of the country "centroid" is chosen as
the location of the county. Note that since there 2,873 counties and
only 989 river basins, this is usually not critical. See
countycentroids_watersheds.dbf/dta, which has 2,873 points and is how
I assign each DSP point to a particular location, using a
correspondence between the DSP location and the county (gbcode). See
the DSP directory for an explanation of how this is executed
(gbcode.do, dsplabels.xls). I also assign each county to its nearest
water point and nearest stream. 

j. death_points. This is just a spatial join of each of the 145 DSP
points (using the county centroid for the point) with the closest
water monitoring station. Not used.
 
k. fertilizer. This is provincial fertilizer use in terms of tonnage
of nitrogen and phosphate. See 2004fertilizer_clean.csv, which I am
not currently using.

l. basemap. This is the slightly-tweaked copy of China's 2000 census
shapefile provied by the Harvard Geospatial Library (HGL). See
ch2000longfinal.shp.

*****************************;
* Executable files directory ;
*****************************;

2. dofiles. This is the executable files to make the data, tables, and
figures for the paper. An extensive readme file is in this directory.

*********************************************************;
* Intermediate data files used in the analysis directory 
*********************************************************;

3. datafiles. This is the data for the paper. The key data set here is
dsp_basins.dta. That has the 145 DSP points and all the other
information assigned to them, sufficient to recreate almost the entire
paper.  See also levies_dumping_output_1992to2002.dta, which is used
for Table 9. The basin-level data is also included in this directory
for the water quality-dumping regressions.

4. dumping. This directory contains the .csv files for the water
dumping data by province and year (e.g. 2005data.csv), and the STATA 
data sets created from these files. This also has the dumping by
industry data. It is a direct subdirectory of the program directory 
because it is parallel on my machine and on Ali's PC. The integration 
of the dumping data is executed from the dofiles directory but the
data are saved here as levies_dumping_output_1992to2002.dta.

5. industries. This directory contains the industrial data which is
mostly not used in the paper. This is here as historical artifact.

**************************************************************;
* Data files that I need to bring to windows (my home machine);
***************************************************************;

6. outfiles. Here is where I outsheet the regressions, and the data
for making maps.

************************************;
* Literature review for the project ;
************************************;

7. references. These are papers I directly cite in the paper, in pdf
form. In some cases, my citations were newspaper articles (with
weblinks) or books (which have neither).

8. litreview. These are files that are relevant to the project, which
or may or may not be directly cited.

************************************;
* Data Appendix                     ;
************************************;

9. data_appendix. This contains a cleaner version of the files that I
reference throughout the data appendix in the text. This is the
easiest way for an average user to exploit the data in the paper.

************************************;
* Powerpoints                       ;
************************************;

10. powerpoints. My slides from presentations of this material.

************************************;
* medical                           ;
************************************;

11. medical. A directory devoted to the lit review materials 

************************************;
* logfiles
************************************;

12. logfiles. A directory of STATA logfiles.
 
********************************************************************;
* Note: see the dofiles directory to account for the tables/figures ;
********************************************************************;
